
# Replication File for "Sibling Differences in Genetic Propensity for
Education: How do Parents React?" By Anna Sanz-de-Galdeano and Anastasia Terskaya

Date: 16 of March, 2023

Author: Anastasia Terskaya


Operating systems:  Microsoft Windows 10 Enterprise and Linux Pop!_OS 22.04 LTS

Relative paths are provided in Environment.do

## Data Instructions

1. Apply for the Add Health Restricted-Use Data using CPC Data Portal https://data.cpc.unc.edu/projects/2/view

    Data needs to include Genetic Files (Polygenic Index Inventories) and dbGaP Linkage File

2. Apply for dbGap Data for the Add Health (dbGaP Study Accession: phs001367.v1.p1) using https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001367.v1.p1
  
    The dbGap file that we use is "phg001099.v1.AddHealth.genotype-imputed-data.c1.GRU-IRB-PUB-GSO.set2.tar.gz". This file contains genetic matrices with imputed genotype data. It includes only European ancestry sample. 

3. Apply for the 23andMe GWAS results, which are used to construct weights for the polygenic indexes. Use https://research.23andme.com/dataset-access/

4. After signing Data Use Agreement with 23andMe, request that the SSGAC send the complete set of weights for constructing the polygenic scores from Lee et al. (2018) that includes 23andMe data and exclude AddHealth from the discovery sample. Use information from https://www.thessgac.org/pgi-repository.

5. Download publicly available weights for the cognitive performance polygenic score from the SSGAC for Becker et al. (2021) using https://thessgac.com/papers/13



## Imputation of Parental Genes: 

1. Prepare the original data. 

We will work in $Plink
Inside this folder create folders: 

- AHimputed
- AHimputed_bed
- AHimputed_plink
- AH_final
- AH_EA


1.1. $Plink folder needs to have plink (version from May 2021), plink2 (version from September 2021), and king (version from May 2021) for Windows and Linux programs inside. 

Programs are available at: 

https://www.cog-genomics.org/plink/

https://www.kingrelatedness.com/Download.shtml


1.2. Copy the original dbGAP data to $Plink/AHimputed (see Data Download (2))

1.3. Run **vcf2plink_v2.bat** program inside $Plink subfolder in Windows cmd. The final data will be in $Plink/AHimputed_bed. This file converts the data into bed format.  

1.4. Download weights from the Repository Becker (2021) to the folder $Weights_in (see Data Download (5)):

- "EA-Lee_public_LDpred_weights " EA public weights 
- "CP1_single_LDpred_weights" CP weights
- "AH_EA-single_weights_LDpred_p1" EA weights with 23andMe provided by SSGAC after signing agreement with 23andMe (not publicly available). 

Run the STATA code that prepares a subset of SNPs IDs that we will use. This code is in $do\ImputationParents\Weights_prepare.do.
The code produces file "EALeeSNP" with the list of SNP ids that will be used and weights for polygenic scores construction: 

- "EA_Lee_public_weights"
- "CP_Repo_weights"
- "EA_Repo_weights"

1.5. Run **pgen2bed.bat** program inside $Plink subfolder in Windows cmd. The final data will be in $Plink/AH_final. This program extracts a subset of SNPs that will be used (from EALeeSNP file) and saves them in bed plink format supported by **KING** software. 


2.Merge the data across chromosomes. Run in Windows cmd inside $Plink folder:

    plink --merge-list AH_list.txt --make-bed --out AH_final\AH_EA_chr1_22

Here "AH_list.txt" is the list of all .bed files inside the AH_final folder. 

3.Prepare IBD segments using KING software. On Linux inside the $Plink folder run relate_script_pooled. In Linux terminal, inside the folder type:

    source relate_script_pooled.sh

4.Prepare program and data for imputation

4.1. Copy SNIPar package to $SNIPar (https://github.com/AlexTISYoung/snipar). The package was downloaded in November 2021. 

4.2. Inside of $SNIPar create folders $SNIPar_in

4.3. Now copy or move folder $Plink/AH_final to $SNIPar_in 

5.Prepare Add Health Data: 

    In $do/AnalysisMain run: 
    AH_MERGE.do
    DATA_GENES_BOTH.do

6.Prepare pedigrees. Run STATA code that prepares pedigrees 

    $do/ImputationParents/Prepare_pedigrees.do

The obtained pedigrees are in $SNIPar_in

7.Now in $SNIPar in Python run 

    impute_runner_loop_ASIA.py

The produced output will be in folder $SNIPar_out (which is $SNIpar/ASIA/data/output2 )
This code imputes parental genes

8.Generate parental polygenic scores. Run in $SNIPar in Anaconda terminal.

Enter to the environment:

    conda activate snip-env  

Educational attainment PGI with 23andME:

    python fPGS.py ASIA/data/output_v2/PGS_EA_repo --bedfiles ASIA/data/AH_final/AH_EA_chr~  --impfiles ASIA/data/output_v2/out_chr~ --weights  ASIA/data/EA_Repo_weights.txt

If you want to use public weights for EA PGI: 

    python fPGS.py ASIA/data/output_v2/PGS_EA_pub --bedfiles ASIA/data/AH_final/AH_EA_chr~  --impfiles ASIA/data/output_v2/out_chr~ --weights  ASIA/data/EA_Lee_public_weights.txt

For cognitive performance PGI: 

    python fPGS.py ASIA/data/output_v2/PGS_CP_repo --bedfiles ASIA/data/AH_final/AH_EA_chr~  --impfiles ASIA/data/output_v2/out_chr~ --weights  ASIA/data/CP_Repo_weights.txt


## Measurement Error Correction

The ado file for the measurement error correction is $ado/merrorcorr_mult_rho_cluster_new.ado

8.For the measurement error we need to estimate the heritability of educational attainment. We use GCTA package (https://github.com/jianyangqt/gcta). The version used in 1.93.0beta downloaded in October 2022. We copy the program to $GCTA. 

8.1. Extract unrelated individuals: 
In $Plink in Windowns cmd  run

    plink --bfile ./AH_final/AH_EA_chr1_22 --rel-cutoff 0.025 --mind 0.05 --maf 0.01 --make-bed  --out ./AH_sub2_CH1_22

8.2. move AH_sub2_CH1_22.bed, .bim, and .fam from $Plink to $GCTA

8.3. Run STATA code that generates the phenotypes: 

    $do/MeasurementError/phenotype_prepare_GCTA.do

The .phen file is saved in $GCTA

Also the estimated $R^2$ for EA in AH European-ancestry sample can be obtained from there. 

   
8.4 Heritability estimation:

In Linux terminal in folder $GCTA run:

    ./gcta64 --bfile AH_sub2_CH1_22 --make-grm --out AH_EA_grm 

    ./gcta64 --reml --grm AH_EA_grm --pheno schoolingAH.phen --grm-cutoff 0.025 --out AH_EA_heritEA

    ./gcta64 --reml --grm AH_EA_grm --pheno schooling_resAH.phen --grm-cutoff 0.025 --out AH_EA_heritEA

    ./gcta64 --reml --grm AH_EA_grm --pheno pvtAH.phen --grm-cutoff 0.025 --out AH_EA_heritEA

    ./gcta64 --reml --grm AH_EA_grm --pheno pvt_resAH.phen --grm-cutoff 0.025 --out AH_EA_heritEA


9.Compute Genetic Correlations

Estimate kinship correlation in $Plink folder 

    king -b ./AH_final/AH_EA_chr1_22.bed --kinship  --prefix ./AH_final/AH_EA_kinship

Run Stata code $do/MeasurementError/kinship"


## Main Analysis

All output is produced in folder $table

10.Conduct the analysis using codes from $do/AnalysisMain (Stata):  

10.1. Prepares the final sample (merges parental polygenic scores)

    $do/AnalysisMain/DATAPREP_STAGE2.do

10.2. Main regression results (uses subcommand **$ado/merrorcorr_mult_rho_cluster_new.ado** for the measurement error correction. )

    $do/AnalysisMain/ANALYSIS_ALL_FINAL.do

    Output are Table 3, and Online Appendix Tables  H.7, H.8, H.9, H.10 
        
10.3. Some descriptive statistics and balancing tests

    $do/AnalysisMain/ANALYSIS_DESCRIPTIVE.do

    Output are Table 2, and Online Appendix Tables H.1, H.2, H.3, H.4, H.6, Figure H.1

10.4. External Validity Analysis 

    $do/AnalysisMain/ANALYSIS_EXTVAL.do

    Output is Table H.5

10.4. Placebo tests

    $do/AnalysisMain/PLACEBO.do

    Output is Figure H.2


## Theoretical Simulations


11.Conduct additional analysis for the Appendix (no data is needed for replication): 

11.1. Simulation for Appendix A (using jupyter notebooks or Python):
    
Figure A.1: 
    
    $do/TheoreticalApps/Pubgoods.ipynb

11.2. Simulation for Appendix C (using jupyter notebooks or python):
    
Figure E.1: 

    $do/TheoreticalApps/AppendixE.ipynb


## Measurement Error Correction Simulations   

11.3. Simulation Figures from Appendix F (Stata)

Figure F.1: 
    
    $do/MeasurementError/MAIN_SIM_BY_RHO.do

Figure F.2: 
    
    $do/MeasurementError/MAIN_SIM_BY_EFFECT.do

Figure F.3: 
    
    $do/MeasurementError/MAIN_SIM_BY_PAREFFECT.do


## ADDITIONAL INSTRUCTION FOR PYTHON ENVIRONMENT FOR SNIPar

1. Go to the directory where you want to work
2. Create the environment with the Python version needed, in our case, 3.7

    ```{bash}
    conda create -n snip-env python=3.7
    ```

the new environment is called snip-env

2. Activate the environment with

    ```{bash}
    conda activate snip-env
    ```

3. Intall packages

    ```{bash}
    conda install -c anaconda h5py==2.10.0
    conda install -c conda-forge bgen-reader==4.0.7
    conda install -c anaconda pandas
    conda install -c anaconda scipy
    pip install pysnptools
    conda install -c anaconda pandas
    conda install -c anaconda networkx
    conda install -c anaconda cython
    ```

